Data Analysis with Python - Course

Data Analysis

Data Analysis with Python Certification | freeCodeCamp.org
We're basically trying to transform data into information

This is where Python, PI Data Tools, excel at

  • Gathering the data
  • Cleaning it
  • Transforming it for further analysis

This is where Pandas excel at

  • read, clean and transform our data

Modeling Data
adapting real life scenarios to information systems using inferential statistics to see if any pattern or model arise.

  • use statistical analysis features panelists and visualizations for matplotlib and Seabourn.

You can use Excel, CSV, XML and API inside of Jupyter Notebooks


Numpy

Oftentimes, you will not be working directly with numpy.
pandas, and matplotlib. And they are all working on top of NumPy.

Python is not the right tool for computation of large datasets
Thus NumPy is a very efficient numeric processing library that sits on top of Python,

multi indexing in Numpy
The NumPy library needs to know what's the type of the object you're storing

NumPy, stores numbers date Booleans, but not a regular individual objects, as we're seeing right here
#myquestion Wtf is a regular individual objects

NumPy says we can create multi dimensional arrays

NumPy has a ton of attributes and functions to work with multi dimensional arrays.

  • shape of an array, which is two rows by three columns,
  • how many dimensions it has, it has one vertical and one horizontal, we have two dimensions.
  • what's the total size of the array, in this case, total size is six

Selecting Elements in Dimensional Arrays

a = np.array([
              #0 #1 #2  
              [1, 2, 3] # 0
              [4, 5, 6] # 1
              [7, 8, 9] # 2
])

[row][element]


Numpy Concepts
vectorized operations and broadcasting

vectorized operations are operations performed between both arrays and arrays, and arrays and scalars,
Basically, allows you to perform operations on entire arrays
So instead of using for loop to scan through an array to add each numbers
vectorized operations is used as a more efficient way to add each numbers in an array

NumPy is an immutable first library, it will not any operation, you performing an array will not modify it, but it will return a new array. \


Visualizations using Matplotlib

Matplotlib has a global API and an object-oriented API for creating visualizations.
- Global API
- simpler for quick and straightforward plotting tasks
- often manipulate the current figure and axes directly, and the state is maintained globally, which can lead to confusion when dealing with multiple plots or figures.
- Object-oriented API
- more explicit and allows precise control over plots.

Matplotlib supports various plot types:

  • line plots,
  • scatter plots, (can encode multiple dimensions using color and size of data points.)
  • histograms,
  • bar plots,
  • and box plots (useful for visualizing data distribution and identifying outliers.)
  • Kernel density estimator diagrams are similar to histograms for distribution estimation.

Statistical analysis helps determine whether a value is valid or an outlier, depending on the context.


Transforming different files into Panda's DataFrame

You can also read data from Databases (SQL, Postgres, etc...)
So basically, you can import Excel, CSV, Databases, HTML into a panda's Dataframe

.execute() method that allows you to execute SQL queries against the database
fundamental step in interacting with databases programmatically.